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Abstract 

The parameters of a discrete stationary Markov model are transition probabilities between states. 
Traditionally, data consist in sequences of observed states for a given number of individuals over 
the whole observation period. In such a case, the estimation of transition probabilities is straight- 
forwardly made by counting one-step moves from a given state to another. In many real-life 
problems, however, the inference is much more difficult as state sequences are not fully observed, 
namely the state of each individual is known only for some given values of the time variable. A 
review of the problem is given, focusing on Monte Carlo Markov Chain (MCMC) algorithms to 
perform Bayesian inference and evaluate posterior distributions of the transition probabilities in 
this missing-data framework. Leaning on the dependence between the rows of the transition ma- 
trix, an adaptive MCMC mechanism accelerating the classical Metropolis-Hastings algorithm is 
then proposed and empirically studied. 

Keywords: Bayesian inference, industrial reliability, missing data, Markov models, adaptive 
MCMC, Gaussian copulas. 



1. Introduction 

When facing situations where a variable of interest Z takes time-dependent values within a 
finite (discrete) set S = {si,S2, . . . ,Sr} of r classes (let us call them states), a decision-maker is 
most often interested in estimating the probability p^(t) for Z to be in a given set of states ^ C S 
as a function of time t. For instance, these states can correspond to failure states in rehabihty 
analysis, thus 1 — J5^(t) is the reliability function of the system under investigation E at time t. 
Another function of interest could be the expected number of states Nj^ before S reaches A. To do 
so, the decision-maker first has to estimate the transition probabilities from one state to another, 
i.e. estimate the transition matrix ip. The vector p(0) of the initial probabilities pi(0), . . .pr{0) 
for the system to be in states Si, . . . respectively, at t — 0, is usually assumed known in real- life 
applications; therefore, the knowledge of the transition matrix tj) allows to evaluate, for a given 
time t, the probabilities for being in each of the r states, i.e. the vector: 

pit)^p{0)-ip\ (1) 

Given some data z under the form of observed sequences of states, the statistical estimation of 
these probabilities is traditionally facilitated by a time-homogeneous, first-order Markov station- 
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arity assumption about the process A which generates the data. In other words, the transition 
probabihty ipij from any state to any other state sj {i possibly equals to j) is assumed to be 



independent of time and of the past trajectories before reaching Sj. As noted by Jones (2005), this 
assumption can seem somewhat restrictive regarding the external knowledge about the process, 
but "using (possibly more appropriate) higher-order processes increases the complexity and data 
requirements quite substantially, and may not be feasible with only a limited time series" , which is 
often the case in practice. Such situations are encountered in a wide range of application domains, 
as we discuss now. 

To start with, the applications of Markov models in engineering practice are numerous. In 
reliability engineering and safety analysis, discrete Markov schemes, the states of which correspond 
to gradually degraded operating conditions, have for instance been used to assess the rehabihty 



of programmable electronic systems (Bukowski and Globe 1995), cogeneration plants (El-Nashar 



2008 ) , machineries of oil refineries ( Cochran et al. , 2001 ) , water meters ( Pasanisi and Parent , 2004 ) , 



piping systems of power plants (Cronvall and Mannisto, 2009) and welded structures submitted to 



fatigue damage (Lassen, 1991). In this paper, the example treated in Section 2.5 provides a real 
industrial use-case where a Markov model is used to assess the reliability of rotating machines. 
Examples of application in hydrology and water resources engineering concern the modeling of river 
inflows (Parent et al. , 1991), lake inflows (Duckstein and Bogardi, 1979), water supply reservoir 



states (Vogel, 1987) or pollutants propagations (Ang and Tang, 1984 Zhang and Dai, 2007). In 



biomedical survey, Markov chains can model the health condition of patients affected by infectious 
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population. As a last example, the financial world makes a wide use of first-order Markov transition 



matrices to explain the migration of credit ratings (Jones, 2005 Fuertes and Kalotychou, 2007) or 



to model loan defaults (Grimshaw and Alexander 2011). 



In an ideal framework, the data z consist in m time series of observed states for m identical 
individuals (systems) S that are assumed independent. If no data is missing, the estimation of ij) 
is relatively straightforward. In many applied problems, however, part of data is missing. Such 
problems can often be divided in two classes. 

(i) We call an incomplete sequence problem the estimation of if) when z are observed trajectories 
of states: 



^(1,1) 



Z{2,2) 



Z{l,T-l) 
Z(2,T-1) 



Z(2,T) 



Z{Tn,l) 



Z{m,T~l) 



containing random missing items (random successions of unknown states symbolized by "•"), 
assuming the initial state is known. This occurs typically when the m individuals are checked 
at deterministic times t = 1, . . . ,T, independently from A, as noted by Dupuis ( 1995 ) or when 
the survey of all individuals at the same time is impossible (e.g. only a given proportion of 
the machineries can be inspected simultaneously, in order to avoid stopping the industrial 
production) . 

(ii) We call an aggregate data problem the estimation of i/; when the sequential data z are 
reduced to the numbers of individuals Uiit) being in a given state Sj at a given time t 
(i.e. ni{t) = Yl^Li l{z(j t)='5»})- Such data are frequently (Gouno et al. , 2011) the only ones 
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being at disposal of the analyst, because, for instance, the full trajectories of individuals 
represented too much information or were not considered of primary importance during the 
survey process. 

Our first aim in this paper is to give in Section [2] a short review of the main computational 
methods dedicated to the estimation task in the complete data scheme, as well as in both missing 
data schemes described above, in a Bayesian statistical framework that we defend as being the 



most convenient. Monte Carlo Markov Chains (MCMC) (Robert and Casella, 2004) are used 
to deal with such missing data schemes. More specifically, we address ourselves to the reader 
who is interested in reducing the computational cost, which appears as a practical difficulty when 
the number of degrees of freedom is high and/or when the Markov model is part of a larger, 
encompassing model. Simulated experiments help us give some guidelines to select a variant of 
these methods in the common case where, for each individual, the state at just one time step has 
been observed. 

In Section|3]we consider the time-consuming aspects of numerical computation. Two adaptation 
mechanisms, which explore the correlations between the transition probabilities, are proposed to 
accelerate the convergence of the Metropolis-Hastings algorithm towards the posterior distribution. 
A series of numerical experiments based on a class of transition matrices largely encountered in 
reliability analysis is led in Section |4j Although they remain relatively basic, we show that our 
proposals can help reduce the computation time significantly. Finally, a discussion section sums 
up the main results and advices arising from both main parts of the paper, and highlights several 
promising research avenues, especially in the theoretical description of adaptive MCMC. 



2. Bayesian estimation of transition probabilities: a review 

This section provides a review of Bayesian inference techniques for the estimation of the tran- 
sition matrix tj) under the obvious conditions: 



< < 1 



1. 



(2) 



This estimation problem has thus r(r — 1) degrees of freedom. As stated hereinbefore, we volun- 



tarily chose a Bayesian viewpoint. Besides the more theoretical issues pointed by Robert (2001), 



we motivate our choice, in an industrial context, by the possibility to explicitly (and relatively 



easily) quantify, via predictive simulation (Girard and Parent, 2004), the uncertainty affecting 



some quantities of practical interest for the reliability engineer (e.g. the probability for the system 
to be in a failure state for a given time t, or the mean time before the system reaches one of the 
failure states). Moreover, from a strictly computational point of view, the Bayesian framework 
allows here to deal with some issues that can be quite burdensome in frequentist inference, without 
any particular additional difficulty. These include the intractability of the likelihood expression in 
missing data schemes, the respect of constraints (|2]), the difficulty to obtain a probabil ity distribu - 
tion for the estimators i/?, which requires using (possibly costly) bootstrap approaches (Fuh, 1993). 
Besides, the validity of such distributions remains usually asymptotic. Finally, even if this point 
has not been investigated, using an informative prior could maybe solve some identifiability prob- 



(Puolamaki and Kaski, 2009). 



lems (AUman et al. , 2009), when the dimension of is high and/or data are poorly informative 
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A convenient prior for the transition matrix tj) can be obtained as the product of r independent 
Dirichlet distributions, one for each row of 



As the Dirichlet density is null outside the standard (r — l)-simplex, it is particularly suited as 
a prior distribution of probabilities vectors, that must fulfil conditions ([2|. Another well-known 
rationale for choosing a Dirichlet prior is that it can be seen as the reference posterior for a 
multinomial parameter given some virtual data of state-occupancy, whose sizes jij — 1 can be 



interpreted as measures of the prior's strength (Minka, 2003). However, in absence of precise 
expert opinion in the remainder of this paper, uniform priors {jij = 1, Vi,j) were used for the 



examples shown hereinafter, following the recent recommendations from Tuyl et al. (2009) based 
on symmetry requirements of posterior predictive distributions. 

2.1. Complete sequence problem 

Transition probabilities estimation can easily be performed when complete states time-series 
(often alternatively called panel data) are available for the m individuals. The estimation is based 
on the calculation, for every couple of states (sj, sj), of the number of observed one-step transitions 
from state Si to state sj: 



yy^f V (4) 



t=l k=l 



Full data likelihood can be written as a function of the sufficient statistics Wij by observing that 
conditional on the row vector t/>^ = (V'i,i---V'i,r); the vector Wi = {wi^i...Wi^r) is multinomial with 
parameters if^i and X]j=i ^jj- Therefore, the likelihood L{z\il)) can be written as the product of 
r multinomial terms: 

i=l ^ ' ' ' 

In a Bayesian framework, the estimation of transition probabilities given complete sequences is 
straightforward. The inference problem consists in computing the posterior probability distribution 
of model parameters 7r('0|2) by updating the prior distribution 7r(i/;) conditional to the observed 
data z, through the Bayes formula: 

L(z|V;)7r(V>) 
j^L(2|'0)7r(V') dil^ 

where VL denotes the set of all possible values of i/?. From ([6]), it can be seen that the prior ([s]) 
of ij) is conjugate, i.e. the posterior distributions of the i/^j's are also Dirichlet distributions, with 
parameter vectors equal to (74,1-1-^4^1, . . . , 7i,r + Wj,r). This is the well known Dirichlet-multinomial 
model. 
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2.2. Incomplete sequence problem, ignorable DCM 

In the most general case of incomplete sequences problem, the estimation problem turns out 
to be more complicated. Throughout this study, we will mostly consider the case where the Data 
Collection Mechanism (DCM) is ignorable, which means, in practice, that it can be neglected in 
the statistical data analysis. Besides simplicity purposes, this choice is essentially motivated by 
the framework and the background of our study, which is reliability analysis. Some elements about 
the more general cases of non-ignorable DCM will be provided in the next section. 

Let X(fc^t) be an auxiliary binary variable (missingness indicator) which is one if the observation 
is missing, zero if the state has been observed. The DCM is described by a complementary 
statistical model specifying P |2, z^i^, v) ? i-e- the probability for an observation to be missing, 
depending on observed and unobserved data and (possibly) some other parameters r/. 

Fulfilling two conditions is sufficient for ignorability (Gelman et al. , 2004[ ): the first one states 
the independence between the parameters of the DCM and the main model (here t] and if) re- 
spectively), the second one asserts that the probability that an observation is missing does not 
depend on missing data (MAR: missing at random condition). The first condition is generally 
easily checked, while the second one highly depends on the context of the statistical study. For 
instance, in capture-recapture experiments the probability of recapture may depend or not on the 
state of the individual (e.g. younger animals can be more easily captured than older ones). In 
longitudinal medical surveys the health state of a patient can prevent him from going to a pe- 
riodical visit (e.g. in case he/she is hospitalized). In an industrial reliability framework, and in 
particular in the specific context of EDF (Electricite de France), the presence of missing data is 
mainly due to the impossibility of simultaneously surveying the whole population of components 
for cost or system availability reasons. This motivates our choice to mainly focus on ignorable 
DCM situations. 

Let us now come back to our estimation problem. In incomplete sequences problems, the 
likelihood has a highly complex expression. It is the product of m terms which are the probabilities 
to observe each one of the m sequences. Whilst writing the term related to an incomplete sequence, 
one must consider all possible values of the unknown observations. For example, the probability of 
the sequence (si, si, •, •, S3) must be written by taking into account all possible three-steps paths 
from state Si to state S3: 



P(si, Si, 



S3) oc J2 



1=1 



j,3 



Estimation methods dealing directly with the likelihood expression may be quite tricky to perform 
(Deltout et al. , 1999). On the other hand, Bayesian inference can elegantly be performed by means 



of a Gibbs sampler. 

This procedure is particularly adapted to the cases where the posterior distribution of model 
parameters would be more easily determined if data were fully observed. Missing data are consid- 
ered as additional model parameters Zja±s{k,t) and, within the Gibbs sampling, an additional step 
is performed to simulate them, thus completing the data set. This technique is usually known 



as data augmentation (Robert and Casella, 2004). Note that Gibbs sampling may be viewed as 



the Bayesian mirror of Stochastic Expectation-Maximization (SEM) algorithms based on a similar 



mechanism (Deltout et al. 1999). 



In our case the augmented data set, 
individuals: 



say y, is the set of the completed state sequences for all 



y{k,t) = Z(k,t) if Z(^k,t) is observed. 



5 



and 



y(k,t) 



'niis(fc,t) 



otherwise. 



The Gibbs sampler algorithm for the incomplete sequence problem can be viewed as a particular 



case of the more general method for the Arnason-Schwarz capture-recapture model (Marin and 



Robert, 2007). We first initialize the algorithm by arbitrarily completing state sequences. Then 



at each step = 1, 2, . . ., we perform the following two-step procedure: 
1. drawing new parameter values, conditional on the augmented data y 



[h-i]. 



Wi \V 



I'^-y ~ Dir (7,1 + ^£ 



[h-i\ 



where vof^ are the sufficient statistics (4) evaluated from current completed sequences 

2. drawing missing data z"^}^^^^^^ conditional to the current values i/?'''' of model's parameters 
(data augmentation step). This can be done by sampling from a conditional categorical 
distribution defined by the following probabilities: 



P 



I \h-^ 



cx ijj^Jj , for t 



^{yt^T) = SJ\ytk,T-l) = s^,i^^'^) OC ^!3,fort 



T, 



and 



P 



y{k,t) 



t-i) 



Hi ) 



y{k,t+i) 



OC 'ifjf^^j ■ ipf^j^ , otherwise. 



The computational method shown above is quite general and easy to implement. On the other 
hand, the more incomplete the sequences are, the more additional parameters are required and the 
more the data augmentation step becomes time-consuming. This issue will be illustrated later on 



in the example of Section 2.6 A technique to accelerate this step, consisting in simulating blocks 



of consecutive missing data instead of one datum at a time, is proposed by Dupuis and Schwartz 



(2007) 



A particularly interesting case of incomplete sequence problem occurs when each individual is 
observed just once over the observation period. This can happen in industrial reliability when the 



data come from the first survey of operating machines, as in the real-world example of Section 2.5 



or from destructive controls (Pasanisi and Parent, 2004). Then let (with 1 < < T) be the 
time when the individual k has been observed and Sj be the observed state. The state sequences 
takes the form: 



In that case, it can be shown (proof in Appendix A) that the likelihood L (z\tp) has the general 
expression: 



t=i j=i 



(7) 
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In the formula above, Pj{t) is the unconditional probability for the system to be in state sj at time 
t and n'j(t) = 1{2(, t)=sj} is the number of times the state Sj has been observed at time t in 
the data sample z. It has to be noticed that the expression of the likelihood depends on sufficient 
statistics n'j{t) and the statistical problem is equivalent to the aggregate data problem considered 
hereinafter. In this particular case, Bayesian estimation can be performed using the Gibbs sampler 
described above or the Metropolis-Hastings procedure we carry out for the aggregate data problem 
m Section 1231 

2.3. Incomplete sequence problem, non-ignorable DCM 

Let us now consider the more general case where DCM is non ignorable. 



This problem has been studied in detail (Little and Rubin, 1987, Chapters 6-10) in particular 



within the framework of longitudinal medical surveys: indeed, for different reasons, patients can 
leave the study permanently (dropout) or temporarily (intermittent missing). Using the same 
notation as in the previous subsection, let yk be a complete data sequence for the individual k 
(while Zfc denotes the actually observed sequence). The different ways for coping with MNAR 
(missing not at random) observations rely, from a technical point of view, on the way the full-data 
likelihood L[yk,Xk\il^,r]) is factorized. Three types of factorization are usually proposed: 

L{yk\xki'ik) ■ L{xk\r]) (pattern mixture model). 



L{yk\i/}) ■ L{xk\yk, rj) (selection model). 



and 



L{yk\xk,Vk,ip) ■ L{xk\vk,r]) ■ f{vk\X) dvk (shared parameter model). 



The formulations above can be complexified, by considering the infiuence of covariates in both 
the main and the missingness models. 



In the pattern mixture framework (Little , 1993 ), the analyst models the conditional distribution 
of the observable outcome, given its observation pattern, and the distribution of the different 
patterns. As a matter of fact, the data are stratified (each pattern determines a stratum) and the 
main parameters i/' are estimated in each stratum. 



The selection factorization, first introduced by Rubin (1976), instead, focuses on the dependence 



between the missingness and the actual value of the observable variable (in our case the state of the 
individual). This scheme explicitly copes with the distribution of the complete data y conditional 
on the main parameter of the model, here if). The DCM parameters t] are easy to interpret and 
provide additional valuable information to the analyst. 

In the shared-parameter scheme (Wu and Carroll, 1988), the missing mechanism is indirectly 



related to the observable variable through a latent variable v, depending on some additional 
parameters A. 

In the particular framework of the estimation of transition probabilities 



Cole et al. (2005) 



considered categorical quality-of-life data in cancer clinical trials, using a selection factorization. 
Transition probabilities ipij and missingness probabilities rji = P{xk,t = l|z/(fc,t) = Si) both depend 
on observable covariates. 



The Arnason-Schwarz model (Dupuis, 1995 Marin and Robert, 2007), also based on a selection 



factorization, provides an elegant Bayesian solution in the case where the T]iS do not depend on 
covariates. In this case, a natural choice of the prior for each one of the ?7j's is a Beta pdf: Be(aj, /3i). 
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The Gibbs algorithm for estimating the posterior of [r], is straightforward as, conditional on the 
the complete data y, both posterior distributions of r] and i/? are explicit. The detailed description 
of the two steps of the algorithm (data augmentation and parameters estimation) is given in 



Appendix B 



2.4- Aggregate data problem 

In many real-life problems, we do not follow individuals passing from state to state and the 
only available data for estimating transition probabilities are aggregate data n, i.e. the number 
of individuals ni{t) being a given time t. Any track of individual trajectories 

is lost. That may occur in practice when a population of m individuals has been followed over 
an observation period but the original aim of the survey was simply having the fractions of the 
population in particular states. State sequences have thus been considered as raw data and dis- 



carded. Examples in sociology and population dynamics were highlighted by Bartholomew (1973) 



(2005) 



and Pollard (1973), among others. Applications in credit rating were recently studied by 



Jones 



Conditional on the proba- 
nd (t)) is multinomial with 



The inference problem has been formalized by Lee et al. (1968) 
bility vector p{t) = p{0) ■ the data vector n{t) = (ni(t), n2(t), . 
parameters p{t) and X]j=i%(^)- The likelihood L (nli/?) can then be written as the product of T 
independent terms: 



L (n|V) = n V WvAtT'^'^- 



(8) 



t=i n ^jW- 



Lee et al. (1968) focused on obtaining point estimates of the matrix i/? and in particular the 
posterior mode of 7r(t/j|n) by maximizing the product of the likelihood ([s]) and r independent 
Dirichlet priors (|3|, one for each row of i/?. 



In the same frequentist context, MacRae (1977) then Kalbfleisch and Lawless (1984) were 



among the main authors who developed generalized least square estimators to remedy the diffi- 
culty of the maximum likelihood estimation, because of the untractability of L(n|i/?). Under mild 
conditions on the stationary matrix t/?, Kalbfleisch and Lawless ( 1984[ ) obtained general consistency 
results and asymptotic r(r — 1)— variate normality (in T and N = ^j=i^j(^)) for the estimated 
vector i/'row of entries in '0 written rowwise, i.e. if^mw = ("^i.b • • • :4'i,r-i,4'2,i, ■ ■ ■ ,4'r,r-i)- Lawless 



and McLeish (1984) gave conditions on functions of interest for which the information loss due 



to aggregation is asymptotically negligible with respect to the estimation based on complete se- 



quences. In a specific reliability framework, Gouno et al. (2011) recently provided a methodology 



to estimate such functions of interest (e.g. survival probability, sojourn time in a state) . 

In a Bayesian context, the inference problem can be solved by using a Metropolis-Hastings 
(MH) algorithm to construct a sample of matrices of : i/jM,?/?!^], . . . ,■0^^^, . . ., asymptotically 
drawn from the posterior 7r(t/>|n), by sampling at each step h a candidate vector from a given 
distribution function J(-|t/>[^~^l). The candidate is accepted with probability: 



(9) 



i.e. the acceptance of the candidate is the result of a Bernoulli trial of probability p(i/>['^l*|?/j['^~^l). 
The instrumental density function J(-|i/?t^~^l) allows a random exploration of the space of param- 
eters. The convergence of the chain to the target distribution is proved for any arbitrary func- 
tion J(-|-) which satisfies mild regularity conditions (Robert and Casella, 2004). In the present 
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case, a comfortable instrumental function is the product of r independent Dirichlet distributions 
Dir((ij • ^'), where di is a positive (scalar) constant. This is a usual case of controlled MCMC 



(Andrieu and Thoms, 2008). As the Dirichlet density is null outside the standard (r — l)-simplex, 
all candidates drawn by the instrumental functions automatically respect constraints (|2]). Ob- 
viously the support of J contains the support of the posterior distribution and the chain is a 
reversible jump MCMC. 

It can easily be seen that the mean of each of the r Dirichlet instrumental densities is tj^f' ^\ 
i.e. the candidate matrix is sampled from a probability function which is "centered" on the last 
retained matrix. The variance terms of the covariance matrix, equal to — i^f^f^'^) / {di + 1), 

depend on the shape parameters di which can be interpreted as tuning coefficients that rule the 
distance of exploration from the current state of the MCMC chain to the next proposed one. 

We notice that, as the expressions of the likelihoods ([T]) and ([s]) are formally the same, up to 
a proportionality constant, the MH procedure described above can also be used in the interesting 
case of incomplete sequences when each individual has been observed only once. Such examples 
are treated in the next paragraphs. 

2.5. Real industrial case-study: survey of power station turbines 

In the example shown hereby, a discrete Markov model has been used to describe the propa- 
gation of transverse cracks on steam turbine shafts. This phenomenon has been first observed on 
EDF facilities in late 90 's and since then periodical non-destructive controls are made to measure 



crack depths. For a description of the technical problem and available survey data, see (Garnero 



and Montgomery, 2006). The most important identified explanatory variable is the time spent by 
the turbine in hot shutdown condition. For the purpose of our study, the time has been discretized 
in equally long intervals. Cracks depths are classified in four states si . . . S4 associated to growing 
crack lengths. The modelling of cracks growth by discrete Markov schemes is quite common, e.g. 



dRoh and Xi| |2000D . 

We assume that the process is irreversible, which is physically correct as crack lengths cannot 
decrease. Thus, the transition matrix is upper-triangular and consequently, = 1- 
We made the hypothesis that all turbines are in state si when putting-into-service at the beginning 
of the study. Manufacture and acceptance controls justify this hypothesis. A set of data collected 
between 1998 and 2001 has been analyzed. The data come from 68 turbines from 24 EDF power 
plants. Each turbine is observed only once for a given value of t between 2 and 7. Given the 
uniformity of EDF French generation facilities (same design, operating conditions and maintenance 
policy for all units), we can assume that observed data are i.i.d. 



The results of MCMC estimation, using the Gibbs sampler described in Section 2.2 (second 
half run of 10000 iterations), are shown in Table 1 (left). The application of the MH algorithm 
described above leads to the same results. 

The data set has been enriched between 2001 and 2004 with new crack measures {t between 2 
and 7). 38 turbines among the 68 previously observed were inspected for the second time and two 
for the first time. Some of the collected data are redundant: this happens when for the first and 
the second observation the corresponding times spent in hot shutdown condition fall into the same 
interval. Finally, 17 new exploitable observations can be added to the data set. The estimation of 
transition probabilities gives the results shown in Table 1 (right). 

We can notice that in this case the posterior variance has been very lightly reduced by incor- 
porating the information conveyed by the new data. Given the posterior samples of transition 
probabilities, some quantities of practical interest in industrial reliability have been sampled: the 
unconditional probabilities of the four states, as a function of time, and the expected number of 
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Data set 1 Data set 2 





Mean 


St. Dev. 


95% CI 


Mean 


St. Dev. 


95% CI 




0.637 


0.042 


[0.551, 


0.719] 


0.655 


0.040 


[0.573, 


0.728] 


V'1,2 


0.306 


0.050 


[0.208, 


0.405] 


0.278 


0.049 


[0.185, 


0.374] 


^1,3 


0.044 


0.034 


[0.002, 


0.127] 


0.056 


0.036 


[0.003, 


0.133] 


^1,4 


0.012 


0.011 


[0.000, 


0.041] 


0.012 


0.011 


[0.000, 


0.041] 


^2,2 


0.713 


0.088 


[0.538, 


0.884] 


0.774 


0.075 


[0.636, 


0.921] 


^2,3 


0.250 


0.087 


[0.079, 


0.418] 


0.197 


0.075 


[0.054, 


0.341] 


^2,4 


0.037 


0.032 


[0.001, 


0.119] 


0.029 


0.024 


[0.001, 


0.087] 


^3,3 


0.872 


0.097 


[0.627, 


0.995] 


0.910 


0.071 


[0.730, 


0.996] 


^3,4 


0.128 


0.097 


[0.005, 


0.373] 


0.090 


0.071 


[0.004, 


0.270] 



Table 1: Turbine cracks exaraple. MCMC estimations of transition matrix using the first data set (left, individuals 
observed only one time) and the second data set (right). Here, the bounds of the posterior 95% credibility intervals 
(CI) are the quantiles of probabilities 0.025 and 0.975 respectively. 



steps before the system reaches the absorbing state s^. As S4 can be interpreted as a "failure" 
condition, the expected time to absorption is here the classical MTTF (Mean Time To Failure). 
Notice that here the term "failure" just means that the crack has reached a given length, arbitrarily 
chosen for the purposes of this paper. 

The calculation of state probabilities using Equation ([T| is straightforward. To evaluate the 



MTTF we made use of a well known property of absorbing Markov chains (Grinstead and Laurie 



Snell, 1997, chap. 11). If we consider the matrices: 

/ i'l,! ^1,2 ^1,3 \ 

C = iIj2,2 ■ip2,3 and / 





the matrix I — C has an inverse and each component t* of the row vector 

t* = (i,i,i)-(/-C)-' 

is the expected number of steps before absorption, given that the initial state was Sj. In our case 
the MTTF is then the first component of the vector t*. 

Figure [l] shows the 95% credibility intervals of the predictive state probabilities for discretized 
time t extended up to 15 and the histogram of 5000 samples from the predictive distribution of 
MTTF. Concerning state probabilities, we can notice that pi credibility intervals are narrower 
than other state probabilities as, according to our hypotheses of an irreversible process and initial 
state Si, Pi{t) = "01 1 which mean that the uncertainty over pi only depends on uncertainty over 
ip-j^ ^ (and no other transition probability). The long tail in the MTTF distribution (which is even 
longer than shown in the figure) is due to the high values (close to 1) of the posterior distribution 

of V'3,3- 

Remark. We stress that, even if the data come from real surveys, the study shown hereinbefore 
is given for exemplary purposes only and neither results nor methodology must be extrapolated 
to make any general conclusion about EDF risk assessment policies. 

2. 6. A four- dimensional simulated case study 

The purpose of the previous example was to give a practical use-case of application of the 
estimation algorithms in a poorly informative data context. The next example will deal with more 
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Figure 1: Turbine cracks example. Predictive 95% credibility intervals of state probabilities (left) and predictive 
distribution of the MTTF (right). 



general computational issues. In particular, we will compare the performances of Gibbs and MH 



algorithms in the case where individuals are observed only once. Following a case-study from Lee 



et al. (1968), we consider the following transition matrix: 



/ 0.6 


0.4 








\ 


0.1 


0.5 


0.4 










0.1 


0.7 


0.2 




I 





0.1 


0.9 


/ 



(10) 



First, complete state sequences for m G {10, . . . , 1200} individuals have been generated for 
T = 20 observation periods, under the hypothesis that at t = the initial vector probability 
is (3/4,1/4,0,0). Then, given complete sequences, a single observation per individual has been 
randomly selected, thus obtaining incomplete sequences. Finally, for each m we used the Gibbs 
and the Metropolis-Hastings algorithms described above. The convergence has been checked using 



the Brooks-Gelman statistic (Brooks and Gelman, 1998) computed on three parallel chains and a 



visual inspection of the chains. A classic rule of thumb (RT) is to suppose quasi-stationarity once 



the statistic stably remains under 1.1 (Brooks and Gelman, 1998). The precision in estimation was 



measured using the relative absolute error matrix between the elements of i/?o and a progressive 
Monte Carlo posterior estimate of i/'- In each case, it has been obtained by using the second half 
run of Metropolis-Hastings iterations and Gibbs iterations after the burn-in periods determined 
by Brooks-Gelman RT respectively. Parameters di were sampled uniformly in [100,2500]. For a 
same estimation error of at most 5% per element, the CPU time observed on a 2.8 GHz CPU 
(Xeon) machine before the RT is fulfilled has been plotted in Figure [2] as a function of m. Plots 
are smoothed over 30 repetitions of the algorithms. Clearly, the increasing number of missing data 
makes Gibbs less competitive than MH: after m = 700, conditional sampling of individuals requires 
more CPU time than our basic MH. The number of missing data to be simulated increases linearly 
with the total number m of individuals, as individuals could be observed only once throughout 
their lifespan. This explains the linear behavior of the Gibbs CPU time. 
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The efforts of the practitioner should then concentrate on improving the mixing of Gibbs 
and MH algorithms to diminish their burn-in period. The development of acceleration methods 



has been the subject of a large number of works, reviewed in (Gilks and Roberts 1996 Mira and 



Sargent 


2003 


Gentle 


2004 



in updating multivariate blocks of (often highly correlated) parameters, were shown to be efficient 



to accelerate Gibbs algorithms in conjugate models (Ischwaran and James 2001 Accoto 2009), 



although their implementation often remains case-specific (Sargent et al. , 2000) and can sometimes 



slow the sampler's convergence (Roberts and Sahu, 1997). Alternatively, the multi-move Gibbs 



sampler (Carter and Kohn, 1994), which was developed for Markov switching state-space models, 
proved to be more efficient than the single-move Gibbs sampling, although its filtering aspects 
might be time-consuming ( Friiwirth-Schnatter , 2006, chap. 11, pp. 342-344). More recently. 



cheaper approximations of the Gibbs sampler using best linear predictors have been carried out 
dNott and Kohnl [2005| . 

In the specific case of MH algorithms, the slow mixing and the computational cost are less 
due to the dimension of the problem than the difficulty of eliciting instrumental distributions to 
efficiently explore the parameter space. The remainder of this article is specifically dedicated to 
an empirical exploration of adaptive approaches aiming to improve this feature. 




Figure 2: Case study of Lee et al. (1968). Mean CPU time needed to reach quasi-stationarity (in the sense of the 
Brooks-Gelman rule of thumb) as a function of the number m of individuals (one individual being associated to a 
single true observation). Data have been generated according to the four-state transition matrix ([lO|). 



3. Accelerating the MH algorithm using adaptive approaches 



Heuristically, implementing an adaptive MCMC consists in sequentially tuning the transition 
kernel using the knowledge of past iterations, in an automated way during the simulation, in order 
to improve the mixing rate (Andrieu and Thoms, 2008). In the particular case of our class of MH 
algorithms, this means modifying the product of Dirichlet densities chosen as the instrumental 
distribution J for the MH algorithm introduced in Section 2.4 Each successive instrumental 



distribution is ideally selected such that parallel sampling can explore a large part of the parameter 
space, especially in the first steps of the algorithm. 
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Recently, a rich literature has been dedicated to these approaches, and is especially focused 
on the preservation of the ergodicity of the adaptive chains towards the stationary distribution, 
which is not automatically ensured by automated tunings. Seminal works on this subject are due 



to Roberts and Rosenthal (2007 2009) and Andrieu et al. (Andrieu and Moulines 2006; Andrieu 



and Atchade 2007; Andrieu and Thoms[ 2008). These theoretical works also led to interesting 
software developments (Rosenthal 2007; Vihola 2010). 



Assuming Fj are indices chosen in some collection y based on past algorithm output, we denote 
by Kr^ the transition kernel updating to 



:iii 



Kr, (V', V'O = Pr, (V', V'O Jt^ mi^) + / (1 - Pr. (V', e)) Jr. (e| V) de 6^ (V' 



where 5^ is the Dirac measure in t/? and 

Pr, (V', V'O 



1 A 



Basically, the ergodicity and stationarity properties of an adaptive MH algorithm can be ensured 
if the amount of adapting progressively diminishes, in the sense that the kernel parameters are 
modified by smaller and smaller quantities, or if the probability of adaptation pr^ decreases towards 



zero as i — )■ oo (Roberts and Rosenthal, 2007, Theorem 5). In the framework considered here, such 
adaptations could be based on eliciting vanishing adaptations for the parameters {di)i<i<r. These 
approaches would however be limitative since each di characterizes the marginal distribution of row 
i, hence they do not explore the correlations between the rows. Therefore, the approach proposed 
here focuses on this particular aspect. 

In the following, assuming we are at step h > 1 of the MH algorithm, we propose two ways 
of building an adaptive instrumental distribution i/?'''!* ~ (denoting Jr^ = Jh in the following 
for simplicity) taking advantage of a a— algebra J^h-i generated by the succession of sampled 
parameter matrices 1/''°'; ■ ■ ■ , Both using a (small) fixed number p of basic MH iterations, 

these approaches explore correlations between the rows in the instrumental sampling. 

In our first approach (DCS-MH), we attempt to summarize the correlations within (i/?i, . . . , i/?^) 
by simply capturing the correlations between the diagonal elements of 

In our second method (RCS-MH), we generalize the first method replacing the r— vector of 
diagonal elements by r elements whose position is randomly sampled within each vector i/jj. Doing 
so, we hope to capture more efficiently the dependency between the and accelerate the DCS-MH 
algorithm. 

Diagonal correlated sampling (DCS-MH). 

At iteration p (large enough) : 

1. denote the set of last p non-identical sampled matrices in , . . . , ^1); 

2. for i = 1, . . . , r 

(i) denote ipi^i = {Tpfj, . . . ,ip^j) the p— vector of replicates of the i— t/i-diagonal element; 

(ii) compute u; = Fi{ipi^i) where Fi is the empirical marginal cdf of ipi^i 

3. estimate the Pearson correlation R^^l of (ui,...,Ur); 

4. sample a candidate vector 'j/'diag °^ diagonal elements , . . . , -i/^i^J* using: 
(i) a Gaussian copula, the parameter of which is 
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(ii) Beta marginal distributions Be ^dj • ^j'^^ ^\ di ^1 — ip^^- 
5. for i = 1, . . . , r 

(i) sample i^f^l* , ■ ■ ■ ,^f^tv4%v ■ f^o^: 



[h^i] 



^-'■■'^ f^_ii ■"«)•••) rf,_ii ■ rf,_ii ■"«)•••) r^_ii ' ^ 



(ii) for j^i, renormalize each V'l'^-* multiplying with \ — 'ijjf^* . 



Randomized correlated sampling (RCS-MH). 

At iteration p (large enough) : 

1. same as step 1 in DCS-MH; 

2. sample (with replacement) a r— vector lG{l,...,r} of random indicators; 

3. for i = 1, . . . ,r 

(i) denote tpij. = {tp^^j , . . . ) the p— vector of replicates of the {i,Ii)—th matrix element; 

(ii) compute Uj = Fi{ipij.) where Fi is the empirical marginal cdf of 

4. same as step 3 in DCS-MH; 

5. sample a candidate vector 'j/'rand °^ elements tp^^^*^, . . . ji/^^^j* following the same main idea 
as in DCS-MH method 

6. For i = 1, . . . , r 

(i) sample ^ ■ ■ ■ ^^f!}*-v4j*+v ■ ■ f^om: 

\ , ,lh-l] [h-i] .[h-i] [h-i] I ' 

(ii) for j^i, renormalize each ipf^j* by multiplying with 1 — 'tpf^\* . 



For a more general introduction to copulas, see for instance Nelsen (2006) or Genest and Favre 



(2007), as well as Kim et al. (2007) for more specific issues about copulas fitting. 

In our experiments, we used a Gaussian copula to sample the new diagonal parameters, mainly 
because of its symmetric properties and its simplicity of calibration using a correlation matrix R 
(Marshall and Olkin, 1988). Note that one has to consider and check up with great care the p 
previously simulated matrices {i/j'^I, . . . , i/j'^I} to make sure that a robust empirical estimator of 
R can be defined, in the sense that its Cholesky decomposition is numerically stable during the 
sampling process (Marshall and Olkin, 1988). The condition number can be used to do so (El 



Ghaoui, 2002). Conditionally on correlated sampled parameters, Dirichlet distributions appear 



necessary to get coherent instrumental sampling of remaining elements within each row vector 
m ■ 
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Theoretical behavior. Despite the large amount of existing work aiming to simplify the conditions 



ensuring ergodicity and stationarity of the target distribution (Nott and Kohn 2005; Roberts and 



Rosenthal, 2007, 2009 Atchade et al. , 2011), theoretical descriptions of kernels based on Dirichlet 
products compounded with Gaussian copulas turn out to be technically complex, and their study 
deserves a specific work which remains outside the scope of this article. Since our primary aim is 
to assess the interest of exploring the correlations between the rows of -0, we adopt the simplest 



approach of a finite sampling scheme when choosing J, as proposed by [Roberts and Rosenthal 
(2007): given a time r < oo, Jr„ = Jr^ for any n > r. This approach is carried out in this paper 
at each sweep of the algorithm after a given mixing period, selecting the final Jr^ as the basic 
product of Dirichlet's described hereinbefore. In substance, r has the sense of an exploration time, 
and in practice is selected as the minimum time between the time required for a fixed number of 
iterations and the time until the Brooks-Gelman RT is fulfilled. 

Nonetheless, this explorative study fits into recent schemes shared by several authors, who 
tested copula-based methods to improve the efficiency of their sampling algorithms. In their 
seminal work on the optimization of the adaptation, Haario et al. (2001) considered Gaussian 



copula instrumental distributions calibrated over the full past of the chains. 



Thoms (2008) for a review of this particular major field of adaptive MCMC. Strid et al. (2010) 



See 


Andrieu and 


Strid et al. 


(2010) 



used the sampling history to continuously calibrate a t— copula proposal distribution, in order 



to sample from dynamic stochastic equilibrium models. Finally, Grain (2011) used products of 
bivariate copulas to tune MGMG during an initialization period only, in the same spirit as the 
finite sampling approach used in the present paper. 



Illustration. Gontinuing the four-dimensional simulated example from Section 2^, we applied the 
DGS-MH and RGS-MH methods with p = 30, still augmenting the number m of individuals 
and using three parallel chains per experiment. Parameters di remain similarly sampled at each 
iteration. Results are smoothed over 50 similar runs of algorithms. The comparison of Gibbs and 
MH burn-in periods in Figure [3} in the sense of the Brooks-Gelman RT, illustrates the improvement 
yielded by RGS-MH. On the other hand, in this case DGS-MH performs worse than basic MH and 
even Gibbs sampler. As we could expect, RGS-MH does clearly better than DGS-MH because of 
its widest exploration of the parameter space. RGS-MH strongly beats Gibbs even for relative low 
numbers of individuals. 

The poor performance of DGS-MH is due to the computational cost of the selection of past 
matrices {i/''^^ • • • sufficiently different to allow for a robust Gholesky inversion. This cost 

clearly increases with the progression towards stationarity since sampled matrices become more and 
more similar and many among them must be rejected in the calibration task of the instrumental 
distribution. The RGS-MH algorithm suffers of course from the same defect, but the much better 
mixing counterbalances the increase of the computational cost, with respect to the basic MH 
algorithm, in a significant way. 



4. Numerical experiments 

This section deals with simulation studies to test the potentialities of our adaptive proposals to 
a wide class of transition matrices commonly encountered in reliability and risk assessment (RRA). 
In RRA, it often occurs that the degradation of a system E is described using r separated states (for 
instance defined by a scale of crack sizes), ordered from minor defect to major failure (replacement 
cause). To be conservative, one may assume that potential repairs following a running failure 
are, at best, as bad as old, namely S remains in the same state than before the failure. In other 
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Number m of individuals 



Figure 3: Case study of Lee et al. (19681. Mean CPU time needed to reach quasi-stationarity as a function of the 



number m of individuals (same simulations as Figure [2]). With respect to the Figure [2j Gibbs and basic MH are 
also compared to DCS-MH and RCS-MH algorithms. 



cases, one might assume these repairs bring actually more complications than real improvement 
(for instance if S is old), so that S is more deteriorated after the repair than before {worth than 



old repair). See Basile et al. (2007) for more details about these notions. Under a stationarity 



assumption, the transition matrix t/? is necessarily upper triangular, with i/j^ = (0, . . . , 0, 1). 

Simulation features. In the following experiments, we test the potentialities of Gibbs and the three 
MH algorithms described hereinbefore (basic, DCS-MH and RCS-MH) as a function of r. We vary 
the dimension r between 2 and r^ax (in practice, we consider rmax = 6 to remain realistic). To 
start with, we need a rule to sample realistic matrices with decreasing dimension: 

1. denote ^JJ^^^ a r x r upper triangular matrix. 

2. create tp^'''^^'^ matrix as follows: for i = 1, . . . , r — 1, 

= V'lj for j = 1, . . . , r - 2 

and 

,{r-l) ,{r) , ,(r) 

Doing so we automatically ensure that i/'r-~L^'' — (O5 • • • 5 O5 !)• The rationale for this construction is 
obviously to increase the probability of a major failure event when simplifying the model. Thus we 
simply need to sample tj)^"^^^^) to get all other matrices considered for simulation tests. Pursuing 
our wish of realism, we assume that worth than old repairs are less probable than as had as old 
ones. Therefore, for z = 1, . . . , rmax — 2 and A; = 1, . . . , r^ax — i — 1, we assume in the sampling: 

^max ^ 

/ (j^max) ^ / (rmax) ^ \ / (»'max) 

p=k+l 
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and especially for i = rmax — 1, r i > r to ensure a constant decreasing of 

i-iicL-v 7 r / niax J-?' max -L ' max -L,' max 

values 'i/'j- o "^i^+i; • • • ; "^ir^ ^ — ''^max- Finally, we selected matrices for which: 

I (j'max) ^ / (^max) 

This models the following case: the closer to a major failure state, the better (the more cautious) 
the repair. Notice that we do not take into account any of our simulation constraints in the following 
estimation procedures, except the presence of zeros beneath the diagonal of t/? (by reducing the 
length of Dirichlet distributed vectors in the instrumental sampling). We consider it as a minimal 



knowledge assumable in real case-studies (e.g. Section 2.5). Finally, per simulated matrix, a 
complete sequence for m = 1000 individuals was generated for T = 20 observation times. As we 
are always in the particular case of "a single observation per individual", only one observation is 
retained in each sequence for the inference exercise. 



Estimation. As in Section 2.6, each experiment for a given r G [3, rmax = 6] consists in running 
three parallel chains for each method and monitoring them using the Brooks-Gelman statistic. 
Relative Euclidian errors on posterior means of matrix components (computed using 1000 iterations 
after a burn-in period determined by the Brooks-Gelman RT) are fixed at most at 5%, involving 
preliminary tests for fixing the total number of iterations. Again, parameters rfj are sampled 
uniformly in [100,2500]. Finally, each experiment is repeated 100 times to average the results 
(each time a new family of matrices i/;(^max)^ _ _ _ ^ ^(3) i^gj^g simulated). 

Results. Boxplots and mean CPU times before quasi-stationarity (in the sense of the Brooks- 
Gelman RT) are plotted in Figures |4] and [5] Results obtained on the simulated example from 
Section 2.6| can be generalized: RCS-MH provides for all dimensions a significant improvement in 



mixing. Similar results have been obtained when carrying out an empirical approach to calibrate 
the mean acceptance rate to a standard nominal value of 50% then 25%. 



5. Discussion 



5.1. Main ideas and results 

This article first aims to provide a general review and technical advises about the Bayesian 
estimation of finite-state transition matrices tj) in discrete Markovian models under various missing 
data schemes, which appear to be of particular interest in several domains, especially in engineering. 
Actually, reliability practitioners may frequently deal with classes of upper-triangular transition 
matrices that have been chosen for most of the experiments presented here. Depending on the 
nature of available data, the practitioner may have to choose between Gibbs or Metropolis-Hastings 
(MH) algorithms. The time-consuming features of these algorithms, depending on the size of 
missing data and the dimension of the problem, appear as limiting factors in practice. Therefore, 
the second part of this article focuses on a first exploration of two adaptive mechanisms (DCS-MH 
and RCS-MH) likely to accelerate the MH algorithms. 

Numerical experiments have highlighted, on this specific class of examples, that using instru- 
mental distributions based on Gaussian copulas to account for the correlations between the rows of 
i/? yields a better mixing of the chains, implying a significant reduction of the computational cost. 
The gap with basic MH strategies, based on the independent sampling of the rows of if), increases 
with the number m of individuals or the number r of states. The simplicity of the approaches pro- 
posed here lets us think that any practitioner dealing with aggregate data could easily implement 
the DCS-MH and RCS-MH mechanisms and reduce the computational time. 
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Figure 4: RRA case study. Boxplots of CPU times needed to reach quasi-stationarity as a function of the dimension 
r. Half hues indicate median and bounds indicate most extreme values. Data have been generated by upper- 
triangular transition matrices. 



Supplementary experiments have highlighted that the CPU time can be still diminished by using 
two "coarse" versions of the DCS-MH and RCS-MH mechanisms. They consist in estimating the 
copula parameter R directly from the Pearson correlation of the matrix elements, namely removing 
the step 2.(ii) in each mechanism. These coarse approaches (we call them DCS-C-MH and RCS- 
C-MH) have been be compared to the previous ones in Figure |6] Here, the difference in CPU time 
is mainly due to the cost of empirical inversions in the DCS-MH and RCS-MH methods. 

The adaptive schemes proposed here (especially the most powerful RCS-MH and RCS-C-MH), 
which remain only empirically studied, deserve a more specific study from both theoretical and 
applied viewpoints. This point is more widely discussed in the following subsection. 

As a take-home message, in the most general case of incomplete data problems with several 
observations per individual, the Gibbs sampler based on the data augmentation technique seems 
to be the only possible alternative. In the particular case of a single observation per individual, 
the adaptive MH algorithms (and especially RCS-MH) are valid alternatives to the Gibbs sampler 
if the number of individuals is greater than a few hundred, say 200, and the number of states is 
greater than three. In low dimensional problems (two or three) the practical interest of adaptive 
MH methods, with respect to the simpler Gibbs sampler, is less obvious. 

5.2. Directions of further research 

The adaptation processes proposed here remain empirical, and theoretical studies are needed 
to build copula-based strategies ensuring the ergodicity and the stationarity of the chains less 



crudely than imposing a finite adaptation time, based on principles initiated by [Roberts and 
Rosenthal (2007) and Andrieu and Moulines (2006). Indeed, fully adaptive MCMC should be 



build on infinite adaptations which continuously modify the choice of the transition kernel using 
the past values of i/' along the chains, quasi-stationarity occurring when these kernel modifications 
become imperceptible. These adaptations should be led on both correlated and marginal features 
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Figure 5: RRA case study. Mean CPU times needed to reach quasi-stationarity. 



of the matrix elements. To this first aim, future studies could focus on the mechanism of state 



permutation, inspired by similar ones carried out in the framework of variable selection (Nott and 



Kohn, 2005), and on removing the strong assumption made by using a Gaussian copula to model 
correlations within the elements of i/?. This choice can appear oversimplified since it does not take 
into account possible correlations between extreme values in the instrumental distribution of i/?. 
Therefore a copula selection procedure should be carried out at different times of the adaptive 
chain, for instance using frequentist tests (e.g. Cramer- von Mises), based on distances between 



estimated and simulated copulas (Genest and Quessy 2006 Nikoloulopoulos and Karlis 2008) 



or Bayesian posterior odds (Huard et al. 2006). As those procedures remain time-consuming in 
dimensions r > 2, this approach was not implemented here in this exploratory work. 

Furthermore, it is necessary that such more sophisticated adaptive Metropolis-Hastings algo- 



rithms be compared in practice to refined Gibbs algorithms, evoked at the end of Section 2.6, that 
could benefit from the stick-breaking properties of Dirichlet distributions. 

Another point of interest could be the adaptation of the methods reviewed here to the case of 
non stationary Markov chains. A simple way for doing this could be to stratify the data on the 
time t or on groups of values of t (Urabake et al. , 1975; Sendi et al. , 1999). The use of logit or 



proportional odds models (Cole et al. , 2005 Grimshaw and Alexander, 2011) to include also the 



effect of additional covariates is another perspective for this work. 
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Figure 6: RRA case study. Boxplots of CPU times needed to reach quasi-stationarity as a function of the dimension 
r. Half hues indicate median and bounds indicate most extreme values. Data have been generated by upper- 
triangular transition matrices. The DCS-C-MH and RCS-C-MH abbrevations indicate two "coarse" versions of the 
DCS-MH and RCS-MH mechanisms. 



Appendix A. Proof of expression ([7]) 

The probability of a given sequence of T -|- 1 states {t from to T) with only one observed state 
at t = tk can be written as the sum of T sums fits into each other (with index from 1 to r), one 
for each unobserved state: 

P(«, Sj, •) is equal to 



In the expression above the sum of the first sums (indices from iq to itk-i) is the unconditional 
probability Pj{tk) for the system to be in state j at time t^. That can be showed by developing 
the recursive formula ([T]): 



^ — ^ai 

= ^ Yl (^fe - 2) ■ ^g,,g, ■ i^g,,j 

^ — ^91 ^ — ^92 

= 5Z 5Z 5Z (^fc - 3) • ^33,92 ■ ^92,91 ■ ^9ii 

^ — '91 ^ — '92 ^ — 



and renaming the index gi,g2,g3, ... as it^-i,iti,-2,hk-3: ■ ■ ■■ The sum of the remaining sums in 
(??) is one as ipi,j = 1- The probability of the sequence is then Pj{tk)- Thus, the likelihood of 
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m incomplete sequences where each individual is observed only once can be written: 

m r T r T 

k=l j=l t=0 j=l t=0 

which is, under the hypothesis that probabilities Pj{0) are know, the expression ([7| up to constant 
of proportionality. 

Appendix B. Gibbs sampler for MNAR data when missingness only depends on the 
actual state 

First initialize the algorithm by arbitrarily completing state sequences. Then at each step 
h = 1,2, . . ., perform the following two-step procedure: 

1. parameters estimation: 

~ Dir(7a + ^!r',-.7.. + <-^') 

and 

vf'V-'^ - Be(a. + ap-^l,A + 6f-^]), 

tt-ll ^ ^ \h-l\ \h-l\ 

where a = Z] Z] 1/ C'-ii i\' ^ = 1/ 1^-11 n\ ^^^^ w\, are the 

same as in Section 12. 2[ 

2. data augmentation: drawing z^^^^^^^^-^ conditional on the following probabilities: 

P(?/!2i) = ^.bfc;'=^.,V'[^r7['^l) oc r^fl-^S, for t = 1, 



^{y^!!,T) = sM;},T-l) = s^.^l^^'Kl^'^) oc ^f-i,fl fort = T 

and 

P(^|:i, = ^. 1^1.-1) =^n,y!::.i) = ^.,V'f^^ « ^-i^r^. otherwise. 
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